import numpy as np
import math
def mean(list_of_numbers):
return sum(list_of_numbers)/ len(list_of_numbers)
def std_dev(list_of_numbers):
if (len(list_of_numbers)) !=0:
= mean(list_of_numbers)
avg = sum([(i - avg)**2 for i in list_of_numbers])/len(list_of_numbers)
variance = math.sqrt(variance)
standard_dev return standard_dev
return np.nan
Why Standard Deviation
We get millions of data which we are required to transform and then create a model for our business use-case. Data Scientists tend to spend ~80% of the time cleaning and transorming the data while only 20% on discovering insights and developing models. This transformation part includes a primitive mathematical concept that we all learned in our high-school, i.e., Standard Deviation.
Before jumping on to transformation part, we need to understand the data in terms of statistics. We have to understand characteristics of data such as statistical features like mean, variance; the distribution it follows like normal, uniform, poisson, etc. Let’s understand S.D through a practical application..
Understanding use-case for SD
Imagine we have a massive dataset with millions of data points, and some of these data points stand out because they have extremely high or low values. Now, the problem is that these unusual data points can mess up our graphs by making the scales look strange. What makes it tricky is that the number of these unusual data points can change. Sometimes, we might have only regular data, while other times, these strange data points can make up as much as 10% of our data.
So, the solution here is to get rid of these odd data points before we make our graphs. We can do this by ignoring any values that are way above (Mean + 2SD) or way below (Mean - 2SD) before we start plotting our data.
Implementing in Python
To have a S.D., we first need a mean of the numbers. So, let’s start with creating our mean function:
Now that we’ve defined our functions, let’s create a list of length 10 with random integers
= np.random.randint(1,30,10)
list_of_numbers
print(list_of_numbers)
[ 5 27 12 5 2 20 10 8 29 23]
print(f'Standard Deviation for the list of numbers provided = {std_dev(list_of_numbers):0.2f}')
Standard Deviation for the list of numbers provided = 9.34
Using the statistics module
import statistics
= [int(item) for item in list_of_numbers]
list_ print(f'Standard Deviation using the statistics module for the list of numbers provided = {statistics.stdev(list_):0.2f}')
Standard Deviation using the statistics module for the list of numbers provided = 9.85
Note: Not providing int(item)
produces an error AttributeError: ‘numpy.int64’ object has no attribute ‘bit_length’. The bit_length() method is a built-in Python method for regular integer objects, but it’s not available for NumPy data types
Using the pandas module
One important thing to mention here is that pandas function require input as a pandas Series or dataframe. Hence, we will need to convert our Numpy list into pandas Series
import pandas as pd
= pd.Series(list_of_numbers)
series_of_numbers_pd
print(f'Standard Deviation using the pandas for the Series of numbers provided = {series_of_numbers_pd.std() :0.2f}')
Standard Deviation using the pandas for the Series of numbers provided = 9.85
Using the numpy module
print(f'Standard Deviation using numpy for the Series of numbers provided = {np.std(list_of_numbers) :0.2f}')
Standard Deviation using numpy for the Series of numbers provided = 9.34
Using pyTorch
One important thing to mention here is that pytorch function require input as a pytorch tensor. Hence, we will need to convert our list into torch tensor. Another thing worth mentioning is that torch
functions require floating point numbers as input tensors. Hence, we’ll convert all items in our list to float using list comprehension.
import torch
= torch.tensor([float(x) for x in list_of_numbers])
tensors_of_numbers tensors_of_numbers
tensor([ 5., 27., 12., 5., 2., 20., 10., 8., 29., 23.])
print(f'Standard Deviation using torch for the Series of numbers provided = {torch.std(tensors_of_numbers) :0.2f}')
Standard Deviation using torch for the Series of numbers provided = 9.85
Notice any Discrepancies ??
As observed above, you might have noticed that the S.D for our pure function and numpy function gave a result of 9.34, however for the statistics and pandas in-built function, the result is 9.85. The reason for the difference is the denominator of the Standard Deviation formula. The reason is because statistics and pandas calculate the S.D. based on the sampling distribution and not the population distribution. The Sample Stansard Deviation has N-1 as denominator in it’s formula, while the population SD formula has N.
Population Standard Deviation \[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}\]
Sample Standard Deviation \[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\]
Since, we don’t normally know the population mean, we have to use sample mean to calculate variance, and this introduces bias. We estimate the population variance (\(\sigma²\)) based on the sample standard deviation (s²). Let Mean for sample be \(\bar{x}\) and population mean be ų. Data points \(x1, x2..xi\) will be more closer to \(\bar{x}\) than to ų.
This makes \(\sum(xi-\bar{x})²\) to be smaller than \(\sum(x_i-ų)²\) (Numerator of the formula). To compensate for this loss, the formula has a division of N-1 instead of just N when estimating variance.
This is based on the concept of degrees of freedom (df). It represents the number of independent pieces of information available for estimating the population standard deviation when calculating the sample standard deviation. It is “n - 1” for a sample of size “n.”